Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

How are you feeling right now?

I am excited to have enrolled Open Data Science

What do you expect to learn?

I look forward to learn

  • Data analysis
  • Prediction

Where did you hear about the course?

I heard about it in 1905 through an email from DONASCI

My github link

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Sun Nov 28 15:01:34 2021"

The text continues here.


Regression and model validation

Describe the work you have done this week and summarize your learning.

Introduction to the data

In this work space, a student feedback data(more information here) collected between 2014-2015 was used to asses the relationship between learning approaches and the students achievement in an introductory statistics course. The dataset underwent data wrangling analysis and will be used to develop a linear regression model and interpret the results.

1. Load the data into your Rworkspace from your work directory

The first step to creating a work directory where we have stored our data

#set the working directory of your R session the IODS project folder
setwd("C:/LocalData/ocholla/IODS-project")

We load the dataset from the folder saved from Data wrangling exercise.We read the file adding that the header= TRUE and stringAsFactors=TRUE to ensure that the factor variables are not displayed as strings.

students2014<-read.csv("Data/learning2014.csv", header = TRUE, stringsAsFactors = TRUE)

2. Data exploration and analysis

We explore the data structure and its dimensions

str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.5 3.1 3.5 3.5 3.7 4.7 3.7 3.1 4.3 4 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(students2014)
## [1] 166   7

The data contains 166 observations of 7 variables. The observations refers to the students and the variables refer to student gender (male or female), age (derived the date of birth), exam points, attitude, deep learning (deep), strategic learning (stra) and surface learning (surf).

Summaries of the variables in the data

summary(students2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.600   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.300   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.700   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.666   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.100   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.900   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

The mean age of the students in the class is 25 years, with the youngest being 17 years while the oldest at 55 years. The attitude variable was scaled to 1-5, with good attitude was rated at 5.
Female students were almost twice more than the male students in the course, with deep learning having the highest mean value among the three variables. The lowest exam point attained by the students was 7 with the overall class having an average of 22.

Graphical Visualizing of student attitude aganist exam points

library(ggplot2)
#initialize plot with data and aesthetic mapping
p1<- ggplot(students2014, aes( x= attitude, y= points, col = gender))
#define the visualization type (points)
p2<- p1 + geom_point()
#add a regression line
p3<- p2+geom_smooth(method = "lm")
#add a main title and draw the plot
p4<- p3 + ggtitle("student's attitude versus exam points")
p4
## `geom_smooth()` using formula 'y ~ x'

From the plot above, we can deduce that majority of student across both gender had average attitude which coinciding with attaining average exam points. Female students with a negative attitude towards statistics managed to get higher exam points compared to male students with negative attitude.

An advanced plot matrix with ggpairs()

#Access the GGally and ggplot2 libraries
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
#create a more advanced plot matrix with ggpairs()
p<- ggpairs(students2014, mapping = aes (col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
#draw the plot
p

From the scatter plot, across both gender the attitude of the student is highly correlated to the exam points attained. While there is higher likelihood of male students failing in deep learning compared to female students

3.Regression model

Linear regression model is a statistical approach that is used for modelling relationships between a dependent variable y and one or more explanatory variables X. If there is only one explanatory variable then its called simple linear regression while in more than one variable it is referred as multiple linear regression.
Linear model is divided into two parts namely systematic and stochastic parts. The model follows the equation below \[Y \sim \propto + X\beta_1+X\beta_2.....\beta_k +\epsilon \]

  • where Y is the target variable, we wish to predict the values of y using the values of X
  • \(\propto+X\beta_0+X\beta_1\) is the systematic part of the model
  • \(\beta\) quantifies the relationship between y and x.
  • \(\epsilon\) describes the errors, which is assumed to be normally distributed.

Using the student data set, the exam points has been used as the target variable while surface learning, strategic learning and attitude of the students have be used as independent variables in the model. The three predictors or independent variables were selected based on the correlation between them and the response variable.

my_model<- lm(points~ attitude+stra+age, data = students2014)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + age, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## attitude     3.48077    0.56220   6.191 4.72e-09 ***
## stra         1.00371    0.53434   1.878   0.0621 .  
## age         -0.08822    0.05302  -1.664   0.0981 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

The call formula captures the model fitted, e.g. in this case it is a multilinear regression with points as target variables while attitude, strategic learning and age are the independent variables.

Residuals Residuals are the difference between the actual observed response value (exam points) and the response values that the model predicted.
The residual section is divided into 5 summary points, and it is used to assess how symmetrical distribution of the model across the points on the mean value zero. From the analysis, we can see the distribution of the residual appear abit symmetrical.

Coefficients
Coefficient gives unknown constants that represent the intercept and unknown parameters in terms of the model. Under the coefficient section , there are four variables.

  1. Estimate The coefficient Estimate contains four rows ; the first one is the intercept \(\propto\) value, while the rest correspond to estimates of the unknown parameters \(\beta_1+\beta_2 ....\beta_k\) in our multi linear regression model.

  2. Coefficient -Standard error
    The coefficient Standard Error measures the average amount that the coefficient estimates vary from the actual average value of the response variable (exam points). Ideally, this value should be lower than its coefficient.

  3. Coefficient-t value The coefficient t-vale is a measure of how many standard deviations our coefficient estimate is far from 0. When the t-value is far from zero, this would indicate that we can reject the null hypothesis, that is, we declare there relationship between the dependent variables and the target variable exist.In our analysis, the t-statistic value of attitude is relatively far away from zero and is larger relative to the standard error, which would indicate a relationship exist, compared to variables stra and age.

  4. Coefficient -Pr(<t) Commonly referred to as the p-value, a small p-vale indicates that it is unlikely to observe a relationship between the predictors and the target variable due to chance. Typically, a p-value of 5% or less is a good cut -off point. In our model, the p-vales for intercept and attitude are more closer to zero compared to stra and age.

The “signif.Codes” are associated to each estimate, a three star (or asterisks) represent a highly significant p-value. The three predictors were significant (p \(\leqslant\) 0.05), meaning that we can reject the null hypothesis allowing us to conclude that there is a relationship between exam points and three student variables attitude, stra and age.

Residual standard error is a measure of the quality of a linear regression fit. The residual standard error is the average amount that the response (exam points) will deviate from the true regression line.In my_model the response exam points deviated from the true regression line by approximately 5.26 on average. Note that the Residual Standard Error was calculated with 162 degrees of freedom (DOF). DOF are the number of data points that went into the estimation of the parameters used after taking into account these parameters. In this analysis, we had 166 observation (students), intercept and three predictors parameters.

Multiple R-squared
The R-squared also known as coefficient of determination (\(R^2\)) statistics provides the percentage of the variation within the dependent/response/target variable that the independent variables are explaining.In other word, it describes how well the model is fitting the data. \(R^2\) always lies between 0 and 1, with values approaching 0 indicating a regression that does not explain the variance in the response variable well, while a value closer to 1 explains the variance in the response variable).In my_model, the \(R^2\) is 21.82%, meaning that only 22% of the variance found in the response variable (exam point) can be explained by the predictors attitude, stra and age. Adjusted R-squared is shows what percentage of the variation within the dependent variable that all predictors are explaining.

4. Diagnostic plots

Three diagnostic plots were plotted from the model

  1. Residual Vs Fitted
    This plot explores the validity of the model assumptions that are included in the expression \[ \epsilon \sim N(0,\sigma^2) \]
  • The errors are normally distributed
  • The errors are not correlated
  • The errors have constant variance \(\sigma^2\)
  • The size of a given error does not depend on the explanatory variables
  1. Normal Q-Q plot
    Explores the assumption that the errors of the model are normally distributed.When the majority of the residual are falling along the line, then it validates the assumption of normality. Severe deviation of the residuals from the normal line makes the assumption of normality questionable.

  2. Residual vs Leverage plot
    Leverage measures how much impact or influence a single observation has on the model. Not all outliers are influential in regression analysis.Cases which have high influence are usually located at the upper right corner or at the lower right corner. Presence of cases is such spots can be influential against a regression line.When cases are outside the Cook’s distance, meaning they have high Cook’s distance scores) , the cases are influential to the regression results, and the regression results will be altered if they are excluded.

par(mfrow= c(2,2))
plot(my_model, which = c(1,2,5))

In the Residual vs Fitted plot, the residual are distributed without following a distinctive pattern indicating that the linear relation ship was explained by the model.

In the Normal Q-Q plot, most of the residual points are falling along the diagonal line, with little deviation at both ends.Since there is minimal deviation of the residuals from the line, the QQ plot strengthens the assumption of normality of the model

In the leverage plot there are no influential cases as we can see the Cook’s distance as all the cases are well inside of the Cook’s distance lines.

date()
## [1] "Sun Nov 28 15:01:42 2021"

References
1. Rmarkdown for Scientist
2. University of Virginia Library


LOGISTIC REGRESSION ANALYSIS

In this exercise, we explore the use of logistic regression analysis on student achievement data in secondary education in two Portuguese schools. The data attribute include the student grades, demographic, social and school related features. The data was collected by using school reports and questionnaires. More information can be found in the link

Purpose of the analysis is to study the relationships between high/low alcohol consumption and students attributes. The joined dataset used in the analysis exercise combines the two student alcohol consumption data sets. The following adjustment have been made;

1. Load and describe the student alcohol consumption data

#set the work directory
setwd("C:/LocalData/ocholla/IODS-project")
#Load the data
alc<- read.csv("Data/pormath.csv", header = TRUE, stringsAsFactors = TRUE)
#print column names
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "n"          "id.p"       "id.m"      
## [31] "failures"   "paid"       "absences"   "G1"         "G2"        
## [36] "G3"         "failures.p" "paid.p"     "absences.p" "G1.p"      
## [41] "G2.p"       "G3.p"       "failures.m" "paid.m"     "absences.m"
## [46] "G1.m"       "G2.m"       "G3.m"       "alc_use"    "high_use"  
## [51] "cid"
#structure of the data
str(alc)
## 'data.frame':    370 obs. of  51 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
##  $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
##  $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
##  $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
##  $ n         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ id.p      : int  1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
##  $ id.m      : int  2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
##  $ failures  : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
##  $ absences  : int  3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : int  10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : int  12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ failures.p: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ paid.p    : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
##  $ absences.p: int  4 2 8 2 2 2 0 0 6 10 ...
##  $ G1.p      : int  13 13 14 10 13 11 10 11 15 10 ...
##  $ G2.p      : int  13 11 13 11 13 12 11 10 15 10 ...
##  $ G3.p      : int  13 11 12 10 13 12 12 11 15 10 ...
##  $ failures.m: int  1 2 0 0 2 0 2 0 0 0 ...
##  $ paid.m    : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
##  $ absences.m: int  2 2 8 2 8 2 0 2 12 10 ...
##  $ G1.m      : int  7 8 14 10 10 12 12 8 16 10 ...
##  $ G2.m      : int  10 6 13 9 10 12 0 9 16 11 ...
##  $ G3.m      : int  10 5 13 8 10 11 0 8 16 11 ...
##  $ alc_use   : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
##  $ cid       : int  3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...

The dataset contains 370 observation of 51 variables, the variables are a mixture of integers, factors and logical data types

2. Analysis of student variables relationship with alcohol consumption.

Hypothesis on the four variables

  1. sex- male students consume higher alcohol compared to female students
  2. studytime- low study time among the students increase the chances for high use of alcohol
  3. famrel- Good quality family relationship reduces alcohol abuse.
  4. goout- students who have a high tendency of going out with friends are prone to peer pressure which can easily lead to high alcohol consumption

Distribution of alcohol consumption by gender

library(ggplot2)
g1<- ggplot(data= alc, aes(x = high_use, fill= sex))
#define the plot as a bar plot and draw it
g1+ geom_bar()+ ggtitle("Alcohol consumption across students by gender")

High proportion of male students indulge in high alcohol consumption compared to female students, whereas, more female student population consume low alcohol in comparison to the male student with low alcohol intake.

Relationship between sex, quality family relationships and use of alcohol

#*summary statistics by sex and quality family relationships 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

#produce summary statistics by group
alc%>% group_by(sex, high_use)%>% summarise(count= n(), familytime=mean(famrel))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count familytime
##   <fct> <lgl>    <int>      <dbl>
## 1 F     FALSE      154       3.92
## 2 F     TRUE        41       3.73
## 3 M     FALSE      105       4.13
## 4 M     TRUE        70       3.79

Four-fifth of all the female students have low alcohol consumption and they have higher mean in the quality of family relationship compared to female students who had high alcohol consumption. Similarly, male students with high family relationship had low alcohol consumption.

Relationship between sex and incidences of going out with alcohol consumption among the students

#*failures
alc%>% group_by(sex, high_use)%>% summarise(count= n(), going_out=mean(goout))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use count going_out
##   <fct> <lgl>    <int>     <dbl>
## 1 F     FALSE      154      2.95
## 2 F     TRUE        41      3.39
## 3 M     FALSE      105      2.70
## 4 M     TRUE        70      3.93

In both genders, higher incidences of going out are associated with high consumption of alcohol

Does high use of alcohol have a connection to study time?

#does high use of alcohol have a connection to romantic relationship?
a2<- ggplot(alc, aes(x= high_use, y = studytime, col= sex))
#define the plot as a boxplot and draw it
a2 + geom_boxplot()+ylab("Hours")+ggtitle("Student study hours by alcohol consumption and sex")

Among students with low alcohol consumption, the female students spent between 5 to 10 hours in their studies compared to male students who spent between 3-5 hours.

On the contrary, female students with high alcohol consumption spent exactly 5 hours for their studies while the make students spent 2 to 5 hours.

3. Fitting a logistics model

#find the model with glm()
m<- glm(high_use~sex+studytime+famrel+goout, data = alc, family = "binomial")
#print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + studytime + famrel + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5891  -0.7629  -0.4976   0.8304   2.6937  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.2672     0.7312  -1.733  0.08309 .  
## sexM          0.7959     0.2669   2.982  0.00286 ** 
## studytime    -0.4814     0.1711  -2.814  0.00490 ** 
## famrel       -0.4193     0.1399  -2.996  0.00273 ** 
## goout         0.7873     0.1230   6.399 1.57e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 370.69  on 365  degrees of freedom
## AIC: 380.69
## 
## Number of Fisher Scoring iterations: 4

From the model, the p-values for the four variables are statistically significant. The output supports the hypothesis that gender, study time, going out and quality family relationship affects the student’s alcohol consumption. The positive estimate value on the factor variable sex indicates likelihood of high use of alcohol among male students with reference to the female students

Model coefficients

#Print out the coefficients of the model
coef(m)
## (Intercept)        sexM   studytime      famrel       goout 
##  -1.2672175   0.7959494  -0.4813712  -0.4192830   0.7872896

Based on the regression coefficients, the odds of high use of alcohol increased with students being male and those with a high frequency of going out, while it decreased with in students who allocate more study time and have excellent family relationships. Odd ratio Odd ratio is the ratio of expected “successes” to failures.Higher odds corresponds to a higher probability of success, with the value ranging from zero to infinity.

Transform the coefficients to odds ratios and transform it by exponentiation it to return it as a unit change in the predictor variable (high_use), holding all other predictor variables constant

#compute odds ratios (OR)
OR<- coef(m)%>% exp
#compute confidence intervals (CI)
CI<- confint(m)%>% exp
## Waiting for profiling to be done...
#Print out te odds ratios with their CI
cbind(OR,CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.2816141 0.06545486 1.1622100
## sexM        2.2165443 1.31841735 3.7630180
## studytime   0.6179355 0.43751933 0.8576353
## famrel      0.6575181 0.49791884 0.8636137
## goout       2.1974324 1.73873662 2.8198312

With a confidence interval of 95%, the likelihood of high use of alcohol increased by a factor of 2.21 and 2.19 when the student is of male gender and has high tendency of going going respectively. This means by that On the contrary, the odds of high use of alcohol among students who allocate more study time and have good family relationships drops by -38.21% and -34.25%. Meaning in every one hour added in study time chances of high alcohol reduces by 38.21% while improvement of the family relationship reduces high use of alcohol by 34.25%.

Predictive power of the model Exploring the predictive power of the logistic model, through a 2x2 cross tabulation of predictors versus the actual values.

#predict() the probability of high_use
probabilities <- predict(m, type = "response")
#add the predicted probabilities to alc
alc<- mutate(alc, probability= probabilities)
#use the probabilities to make a prediction of high_use
alc<- mutate(alc, prediction= probability>0.5)
#*see the first five original classes, predicted probabilities, 
select(alc, goout, famrel,studytime, sex, high_use,probability, prediction)%>% head(5)
##   goout famrel studytime sex high_use probability prediction
## 1     2      3         4   F    FALSE  0.05335420      FALSE
## 2     4      3         2   F     TRUE  0.41613728      FALSE
## 3     1      4         1   F    FALSE  0.06670564      FALSE
## 4     2      4         3   F    FALSE  0.05657850      FALSE
## 5     1      4         3   F     TRUE  0.02656662      FALSE

The chances of high alcohol consumption declines when the student is female

Confusion matrix (2x2) table

#tabulate the target variable versus the predictions 

table(high_use=alc$high_use, prediction=alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   230   29
##    TRUE     59   52

From the table, 230 low alcohol consumption was corrected while 29 were incorrectly predicted. on the other hand, 52 observation of high alcohol consumption were correctly predicted while 59 were wrongly predicted.

plotting the original high_use versus the predicted in alc

#initialize a plot of 'high_use' versus 'probability' in alc
g<- ggplot(alc, aes(x = probability, y= high_use, col = prediction))
#define the geom as points and draw the plot
g+geom_point()

Accuracy assessment through the confusion matrix

#tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>% prop.table%>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.62162162 0.07837838 0.70000000
##    TRUE  0.15945946 0.14054054 0.30000000
##    Sum   0.78108108 0.21891892 1.00000000

Approximately 19% and 36% were inaccurately predicted for the false and true values respectively by the model.
From the confusion matrix table, the accuracy of the prediction can be calculated through addition of True Negative and Positive (0.62 and 0.14) divided by the summation of both true and false values in both the original set (high_use) and prediction. we get that the prediction accuracy is 0.76, this is a high performance from the model compared to simple guessing strategies.

4. Cross validation

This is a method of testing a predictive model on unseen data split the data into training and testing data, whereby the training data is used to find the model while the test data is used to make prediction and evaluate the model performance

One round of cross validation involves

  1. partitioning a sample of data into complementary subsets
  2. Performing the analysis on one subset (the training set, larger)
  3. validating the analysis on the other subset (the testing set, smaller)

This process is repeated so that eventually all of the data is used for both training and testing.In CV, the value of a penalty(loss) function (mean prediction error) is computed on data not used for finding the model. A low CV value is an indication of a good model.

Cross validation using 10 K-folds

library(boot)
#define a loss function (average prediction error)
loss_func<- function(class, prob){
  n_wrong<-abs(class - prob)> 0.5
  mean(n_wrong)
}
#average number of wrong prediction
loss_func(class = alc$high_use, prob = alc$probability) 
## [1] 0.2378378
cv<- cv.glm(data= alc, cost= loss_func, glmfit= m, K=10)
#average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2459459

The model had a CV of 0.23 which is much better than that in the DataCamp.

super bonus Perform cross-validation to compare the performance of different logistic regression models (=different sets of predictors). Start with a very high number of predictors and explore the changes in the training and testing errors as you move to model with less predictors. draw a graph displaying the trends of both training and testing errors by the number of predictors in the model

#find the model with glm()
summary(m0<- glm(high_use~G3+absences+traveltime+reason+famsize+sex+studytime+famrel+goout, data = alc, family = "binomial"))
## 
## Call:
## glm(formula = high_use ~ G3 + absences + traveltime + reason + 
##     famsize + sex + studytime + famrel + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8894  -0.7459  -0.4418   0.6426   2.7811  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.66560    1.00389  -2.655  0.00792 ** 
## G3               -0.01447    0.04251  -0.340  0.73350    
## absences          0.07678    0.02339   3.282  0.00103 ** 
## traveltime        0.40320    0.19364   2.082  0.03732 *  
## reasonhome        0.34770    0.34200   1.017  0.30930    
## reasonother       1.11530    0.46975   2.374  0.01759 *  
## reasonreputation -0.16462    0.36307  -0.453  0.65025    
## famsizeLE3        0.23888    0.29336   0.814  0.41549    
## sexM              0.87539    0.28243   3.100  0.00194 ** 
## studytime        -0.30531    0.18219  -1.676  0.09379 .  
## famrel           -0.42737    0.14738  -2.900  0.00373 ** 
## goout             0.78725    0.13037   6.039 1.55e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 346.82  on 358  degrees of freedom
## AIC: 370.82
## 
## Number of Fisher Scoring iterations: 5
summary(m1<- glm(high_use~absences+traveltime+reason+sex+studytime+famrel+goout, data = alc, family = "binomial"))
## 
## Call:
## glm(formula = high_use ~ absences + traveltime + reason + sex + 
##     studytime + famrel + goout, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9077  -0.7427  -0.4506   0.6514   2.7420  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.74054    0.85222  -3.216 0.001301 ** 
## absences          0.07805    0.02326   3.356 0.000791 ***
## traveltime        0.41779    0.19046   2.194 0.028263 *  
## reasonhome        0.32961    0.34063   0.968 0.333226    
## reasonother       1.09017    0.46850   2.327 0.019970 *  
## reasonreputation -0.18187    0.36148  -0.503 0.614869    
## sexM              0.89451    0.28156   3.177 0.001488 ** 
## studytime        -0.32030    0.18031  -1.776 0.075671 .  
## famrel           -0.43481    0.14668  -2.964 0.003033 ** 
## goout             0.79177    0.12892   6.142 8.16e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 347.54  on 360  degrees of freedom
## AIC: 367.54
## 
## Number of Fisher Scoring iterations: 5
summary(m2<- glm(high_use~absences+traveltime+sex+studytime+famrel+goout, data = alc, family = "binomial"))
## 
## Call:
## glm(formula = high_use ~ absences + traveltime + sex + studytime + 
##     famrel + goout, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7951  -0.7373  -0.4716   0.6918   2.5876  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.35248    0.80844  -2.910  0.00362 ** 
## absences     0.07725    0.02257   3.423  0.00062 ***
## traveltime   0.37496    0.18585   2.018  0.04364 *  
## sexM         0.89229    0.27716   3.219  0.00128 ** 
## studytime   -0.38722    0.17561  -2.205  0.02745 *  
## famrel      -0.41690    0.14401  -2.895  0.00379 ** 
## goout        0.76216    0.12560   6.068 1.29e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 355.05  on 363  degrees of freedom
## AIC: 369.05
## 
## Number of Fisher Scoring iterations: 4
date()
## [1] "Sun Nov 28 15:01:44 2021"

CLUSTERING

This exercises utilize Boston data found in MASS package.
The data is about housing in surburban area in Boston

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## corrplot 0.92 loaded
library(tidyr)
library(ggplot2)

1. Data exploration

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The Boston data set has 506 observations and 14 variables. The observation values were captured either as integers or numeric data types

1.1 Summary statistics

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The summary shows the descriptive statistics of each of the variable in the Boston data. The variable tax had the largest range between the minimum and maximum values while chas had the least range

1.2 Correlation analysis.

1.2.1 plots between the variables.
#plot matrix of the variables
pairs(Boston[1:6])

pairs(Boston[7:14])

1.2.2 Correlation matrix
#between variables in the data
cor_matrix<- cor(Boston)%>% round(digits = 2)
#print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
1.2.3 Visualization of the correlation matrix
corrplot(cor_matrix, method="circle",type= "upper",
         cl.pos="b", tl.pos="d",tl.cex=0.6)

From the correlation plots and tables, the variables tax and rad are highly positive correlated (corr> 0.8). This is further captured in the visualization of the correlation represented by the bigger circle. Medv and lstat, dis and age, and dist and nox have strong negative correlation but not significant.

2.0 Data analysis.

2.1 Standardizing the data set

Standardizing is done through scaling. This entails subtracting the column means from the corresponding columns and divide the difference with standard deviation

#* center and standardize variables
boston_scaled<- scale(Boston)
#summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
  • All the data have been standardized with all the variables having a mean of zero (0). Compared to the original data where the data ranged between 0.00 to 711 with varying mean values.

2.2 Creation of a new a categorical variable

Create of a new a categorical variable of the crime rate in the Boston data set from the scaled crime rate

#class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
#change the object to data frame
boston_scaled<- as.data.frame(boston_scaled)
#summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110

The range of the crime rate is between -0.419 (min) to 9.924 (max), with a mean of 0.

2.3 Create break points to be used in the categorical variable

#create a quantile vector of crim and print it
bins<- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610

Divides the data into four quantiles.

2.4 Creating break points in the categorical variable

crime<- cut(boston_scaled$crim, breaks= bins, include.lowest= TRUE, labels=c("low","med_low","med_high","high"))

#look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127

Low and high crime rate had 127 observation, while med_low and med_high had 126

2.5 Remove existing variable and adding new variable in the dataset

#remove the old crime variable from dataset
boston_scaled<- dplyr::select(boston_scaled, -crim)
#add the new categorical value to scaled data
boston_scaled<- data.frame(boston_scaled, crime)

2.6 Creation of training and test data sets from the dataframe.

Divide the dataset to train (80%) and test sets (20%). Create the train and test variables and removing the original crime variable.

#number of rows in the Boston dataset
n<-nrow(boston_scaled)
#choose randomly 80% of the rows
ind<- sample(n, size=n * 0.8)
#create train set
train<- boston_scaled[ind,]
#create test set
test<- boston_scaled[-ind,]
#save the correct classes from test data
correct_classes<- test$crime
#remove the crime variable from test data
test<- dplyr::select(test, -crime)

3.0 Linear Discriminant Analysis

Linear Discriminant Analysis is a classification ( and dimension reduction) method. It finds the (linear) combination of the variable that separate the target variable classes use the categorical crime rate as the target variable and all other variables in the data set as predictor variable

#*Fit the linear discriminant analysis on the train set.
lda.fit<- lda(crime~., data= train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2475248 0.2574257 0.2475248 0.2475248 
## 
## Group means:
##                  zn      indus         chas        nox         rm        age
## low       1.0481901 -0.8999506 -0.075474056 -0.8902065  0.4900900 -0.9153443
## med_low  -0.1262889 -0.3377335 -0.007331936 -0.5798875 -0.1193064 -0.3866087
## med_high -0.3911954  0.1820346  0.278864965  0.3718140  0.1327118  0.3536945
## high     -0.4872402  1.0149946 -0.075474056  1.0328559 -0.4003950  0.7953820
##                 dis        rad        tax     ptratio      black       lstat
## low       0.8635521 -0.6740820 -0.7630198 -0.50372334  0.3813311 -0.79309823
## med_low   0.3162265 -0.5600304 -0.5310782 -0.06473557  0.3138584 -0.17558069
## med_high -0.3479086 -0.3881138 -0.2773701 -0.23905156  0.1143675 -0.03947078
## high     -0.8536216  1.6596029  1.5294129  0.80577843 -0.7780627  0.85658648
##                 medv
## low       0.56193706
## med_low   0.04222498
## med_high  0.20345466
## high     -0.67617005
## 
## Coefficients of linear discriminants:
##                  LD1          LD2         LD3
## zn       0.114291802  0.944087713 -0.93598071
## indus    0.007086594 -0.152759693  0.08995595
## chas    -0.078333646 -0.074964293  0.02540312
## nox      0.293964917 -0.723780206 -1.41640860
## rm      -0.104662208 -0.072265096 -0.22235530
## age      0.306051337 -0.220673737 -0.15216638
## dis     -0.072801267 -0.464096956  0.09152711
## rad      3.130136502  1.414896695  0.06113764
## tax      0.066045636 -0.651626932  0.54442124
## ptratio  0.149142795 -0.005943911 -0.26843845
## black   -0.171116703  0.007730085  0.08667792
## lstat    0.161248387 -0.300776135  0.27727831
## medv     0.185410574 -0.441062037 -0.24103869
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9434 0.0426 0.0139

LDA determines group means and computes , for each individual, the probability of belonging to the different groups of the target variable. The individual observation is then affected to the group with the highest probability score.

Interpreting the results
Prior probabilities of groups : The proportion of training observation in each group.Such that:

  • There are 23.5% of training observations in the low
  • 25% of training in the med_low category
  • 26.48% of training observation in med_high category
  • 25% of training observation in high category.

Group means: represent the group center of gravity. It shows the mean of each variable in each group.
Coefficient of linear discriminant: Shows the linear combination of predictor variables that are used to form the LDA decision rule.
Proportion of trace: shows the variability of each linear dimension, with LD1 having the highest with 94.11 while LD3 with the least at 1.62%

3.1 Draw the LDA (bi)plot

LDA Biplot is designed to show how individual and groups are different.

#create function for lda biplot arrows
lda.arrows<- function(x, myscale=1, arrow_heads=0.1, color="orange",
                      tex= 0.75, choices=c(1,2)){
  heads<- coef(x)
  arrows(x0 =0, y0=0,
         x1= myscale * heads[,choices[1]],
         y1= myscale* heads[,choices[2]], col = color, length=
           arrow_heads)
  text(myscale*heads[,choices], labels=row.names(heads),
                     cex= tex, col=color, pos=3)
}
#target classes as numeric 
classes<- as.numeric(train$crime)
#plot the lda results
plot(lda.fit, dimen=2, col=classes, pch= classes)
lda.arrows(lda.fit, myscale=1)

In LDA biplots the ““arrows” represent the variables. The longer arrows represent more discrimination. In the above plot, rad variable has more variation (larger standard deviation) compared to other variables. This means it is the most influential variable in the model.

In addition, the variables nox and rad are not highly correlated as the angle between them is nearly right angle.Variables nox and zn are negatively correlated.

3.2 Predict a class with the LDA model on the test data

lda.pred<- predict(lda.fit, newdata=test)
#*  cross tabulate the results with the crime categories from the test set.
table(correct= correct_classes, predicted= lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       11      16        0    0
##   med_low    3      15        4    0
##   med_high   1       6       19    0
##   high       0       0        1   26

The predicted low and med_high categories had higher proportion of correct cluster observations compared to the test clusters, whereas the test clusters had higher proportion of correct observation in high and med_low categories.

4.0 K-Means clustering

Reload the Boston dataset and standardize

library(MASS)
data("Boston")
#center and standardize variables
Boston_scaled<- scale(Boston)
#summaries of the scaled variables
summary(Boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
str(boston_scaled)
## 'data.frame':    506 obs. of  14 variables:
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...

From the summary, all the variables have a mean value of zero after being standardized

4.1 Calculate the distances between the observations.

4.1.1Euclidean distance matrix
#euclidean distance matrix
dist_eu<- dist(Boston_scaled)
#summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
4.1.2 Manhattan distance matrix
#euclidean distance matrix
dist_man<- dist(Boston_scaled, method = 'manhattan')
#summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

4.2 k-means analysis

Implement K-means analysis with a random number of K-means

#K-means clustering
km<- kmeans(Boston_scaled, centers = 3)
#Plot the Boston_scaled dataset with clusters
pairs(Boston_scaled, col= km$cluster)

All the pairs have been clustered into three clusters- green, black and brown.

4.3 Determine the optimal number of k means

To determine the optimal number of Kmeans we look at how the total within cluster sum of squares (WCSS) behaves when the number of cluster changes.When you plot the number pf clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically

#set set.seed function to avoid the random assigning of initial clusters centers
set.seed(123) #to avoid random
#determine the number of clusters
k_max<-10
#calculate the total within sum of squares
twcss<- sapply(1:k_max, function(k)
  {kmeans(Boston_scaled, k)$tot.withinss})
#visualize the results
qplot(x = 1:k_max, y= twcss, geom= 'line')

From the plot the point at which there was instant change is probably at 2.
This will act as the optimal number of K-means.

Implement the Kmeans with the optimal number of centers = 2

#k-means clustering
km<- kmeans(Boston_scaled, centers= 2)
#plot the Boston dataset with clusters, vary the number of pairs
pairs(Boston_scaled,col = km$cluster)

From the two plot, the two clusters are distinct from each other. This can be attributed to K-means ability to enhance separability of different clusters.

Bonus:

  • Perform k-means on the original Boston data with some reasonable number of clusters (>2). Remember to standardize the dataset.
  • Then perform LDA using the clusters as target classes. Include all the variables in the boston data in the LDA model.
  • Visualize the results with a biplot (incode arrows representing the relationship of the original variables to the LDA solution).
  • Interpret the results. Which variables are the most influencial linear separators for the clusters?
1a. Load and standardize the original dataset.
#Reload the Boston datset and standardize the dataset
library(MASS)
data("Boston")

#center and standardize variables
boston_scaled2<- scale(Boston)
#summaries of the scaled variables
summary(boston_scaled2)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

All the variables have been assigned to a mean value of zero.

1b. Perform K-means using some reasonable number of clusters (>2).
#k-means clustering
km<- kmeans(boston_scaled2, centers= 5)
#plot the Boston dataset with clusters, vary the number of pairs
pairs(boston_scaled2,col = km$cluster)

All the pairs have 5 clusters

1c Add the clusters into the scaled boston dataset
#add the clusters into the boston dataset
boston_scaled2<- data.frame(boston_scaled2, km$cluster)
1d. Create training and test dataset from the scaled boston data
#number of rows in the Boston dataset
n<-nrow(boston_scaled2)
#choose randomly 80% of the rows
ind<- sample(n, size=n * 0.8)
#create train set
train2<- boston_scaled2[ind,]
#create test set
test2<- boston_scaled2[-ind,]

The original data was divided into 80% training and 20% testing sets.

1e.Perform LDA using the clusters as target classes. Include all the variables in the boston data in the LDA model.
#*Fit the linear discriminant analysis on the train set.
lda.fit2<- lda(km.cluster~., data= train2)
# print the lda.fit object
lda.fit2
## Call:
## lda(km.cluster ~ ., data = train2)
## 
## Prior probabilities of groups:
##          1          2          3          4          5 
## 0.07425743 0.13613861 0.23267327 0.20792079 0.34900990 
## 
## Group means:
##         crim         zn      indus       chas        nox         rm        age
## 1 -0.1854493 -0.3157316  0.3133786  3.6647712  0.4306404  0.2138181  0.4290913
## 2 -0.4138882  2.2993839 -1.1620998 -0.2007454 -1.1957952  0.6874068 -1.4644349
## 3  1.2068757 -0.4872402  1.0299433 -0.2723291  0.9969853 -0.4856256  0.7840759
## 4 -0.3322134 -0.4808597  0.5005827 -0.2723291  0.4126493 -0.5353672  0.6272386
## 5 -0.3982818 -0.1104688 -0.6999989 -0.2723291 -0.6224712  0.3392337 -0.4783739
##          dis          rad        tax    ptratio       black      lstat
## 1 -0.4433830  0.009638649 -0.0682569 -0.4290486  0.14108828 -0.1081627
## 2  1.6196570 -0.679093526 -0.5526750 -0.8218082  0.34273326 -0.9422548
## 3 -0.8263352  1.635167428  1.5322534  0.8052870 -0.80749211  0.9509017
## 4 -0.5482077 -0.585376631 -0.1979161  0.1772572  0.05114137  0.4788749
## 5  0.4272987 -0.554250499 -0.7413582 -0.3548817  0.37017778 -0.5635095
##         medv
## 1  0.5694394
## 2  0.7486460
## 3 -0.8219329
## 4 -0.4545371
## 5  0.4426882
## 
## Coefficients of linear discriminants:
##                  LD1          LD2           LD3         LD4
## crim     0.034626545  0.011867261 -0.1559040056  0.11802273
## zn       0.297786175 -0.287582562 -1.4864689354 -1.24084138
## indus   -0.054367812  0.370023831  0.1632260886 -0.40109753
## chas    -5.190618709 -0.398296510 -0.2982405218 -0.03794020
## nox      0.005563069 -0.287697635  0.0873598437 -0.49971441
## rm       0.055060927  0.015978997 -0.1245985638  0.09946824
## age     -0.123498535  0.189920047  0.4227915007 -0.28603098
## dis     -0.006874898 -0.462773844 -0.4034456808  0.33518750
## rad     -0.304219079  2.649304175 -1.0824093907  2.02998628
## tax     -0.017448586 -0.004645183 -0.6445853812 -1.10990584
## ptratio -0.016550862  0.029361089  0.1926290390 -0.61540473
## black   -0.002242608 -0.096184981  0.0315333992  0.09825891
## lstat    0.082731274  0.294123773 -0.0028274113 -0.26817140
## medv     0.146425853 -0.308358740 -0.0006178331 -0.03878442
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4 
## 0.6388 0.2588 0.0818 0.0206

Interpreting the results
Prior probabilities of groups : The proportion of training observation in each group.Such that:

  • There are 7.42% of training observations in cluster 1
  • 13.61% of training observation in cluster 2
  • 23.26% of training observation in cluster 3
  • 20.79% of training observation in cluster 4
  • 34.90% of training observation in cluster 5

Group means: represent the group center of gravity. It shows the mean of each variable in each group.
Coefficient of linear discriminant: Shows the linear combination of predictor variables that are used to form the LDA decision rule.
Proportion of trace: shows the variability of each linear dimension as follows:

  • LD1 having the highest of the variability with 63.88%
  • LD2 having 25.88%
  • LD3 8.18%
  • LD4 2.06%

Draw the LDA (bi)plot

#create function for lda biplot arrows
lda.arrows<- function(x, myscale=1, arrow_heads=0.1, color="orange",
                      tex= 0.75, choices=c(1,2)){
  heads<- coef(x)
  arrows(x0 =0, y0=0,
         x1= myscale * heads[,choices[1]],
         y1= myscale* heads[,choices[2]], col = color, length=
           arrow_heads)
  text(myscale*heads[,choices], labels=row.names(heads),
       cex= tex, col=color, pos=3)
}
#target classes as numeric 
classes<- as.numeric(train2$km.cluster)
#plot the lda results
plot(lda.fit2, dimen=2, col=classes, pch= classes)
lda.arrows(lda.fit2, myscale=1)

From the plot, , char variable has more variation (larger standard deviation) and is the most influential variable compared to other variables as represented by the longest arrow, this is followed by variable rad. Moreover, the two variable chas and rad are not highly correlated as the angle between them is nearly a right angle

Super-Bonus

model_predictors<- dplyr::select(train,-crime)
#check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
#matrix multiplication
matrix_product<- as.matrix(model_predictors)%*%
  lda.fit$scaling
matrix_product<-as.data.frame(matrix_product)

Access plotly package to create a 3D plot of the columns of the matrix product.

  • Adjust the code: add argument color as an argument in the plot_ly8) function.
  • set the color to be the crime classes of the train set.
  • Draw another 3 D plot where the color is defined by the clusters of the K-means.
  • How do the plots differ? Are there any similarities?
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x= matrix_product$LD1, y= matrix_product$LD2, 
        z= matrix_product$LD3, type='scatter3d', mode= 'markers')
date()
## [1] "Sun Nov 28 15:01:58 2021"

(more chapters to be added similarly as we proceed with the course!)